Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Real-Word Typo Detection

Identifieur interne : 000685 ( Main/Exploration ); précédent : 000684; suivant : 000686

Real-Word Typo Detection

Auteurs : Dmitri Asonov [Russie]

Source :

RBID : ISTEX:B0626485B0ADCDBBA2FF81C856042E8AC4A377AF

Abstract

Abstract: Context-sensitive spelling correction (CSSC) is a widely accepted and long studied formalization of the problem of finding and fixing contextually incorrect words. We argue that CSSC has its limitations as a model, and propose a weakened CSSC model (RWTD) to partially counter these limitations. We weaken the CSSC model by canceling its word-correction role. Thus, RWTD is focused solely on finding words that require correction. Once this is done, the actual correction process is performed by a human or a CSSC solution. We propose a preliminary solution for RWTD model that differs from related CSSC work in several ways. The solution does not rely on a set of confusion lists and detects not only a limited set of confusion typos, but almost any class of typos. The solution offers a flexible trade-off between the time a human is willing to spend on the task and the quality of the proofreading. It does not require POS tagging and may be applied seamlessly to different languages. Experiment running times prove to be acceptable for real-world applications. We report Brown corpus real-word typos that were exposed by implementing our solution. We also discuss experiments in applying the solution to other real-world test texts and demonstrate improved false positive and hit rates.

Url:
DOI: 10.1007/978-3-642-12550-8_10


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Real-Word Typo Detection</title>
<author>
<name sortKey="Asonov, Dmitri" sort="Asonov, Dmitri" uniqKey="Asonov D" first="Dmitri" last="Asonov">Dmitri Asonov</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B0626485B0ADCDBBA2FF81C856042E8AC4A377AF</idno>
<date when="2010" year="2010">2010</date>
<idno type="doi">10.1007/978-3-642-12550-8_10</idno>
<idno type="url">https://api.istex.fr/document/B0626485B0ADCDBBA2FF81C856042E8AC4A377AF/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000512</idno>
<idno type="wicri:Area/Istex/Curation">000505</idno>
<idno type="wicri:Area/Istex/Checkpoint">000265</idno>
<idno type="wicri:doubleKey">0302-9743:2010:Asonov D:real:word:typo</idno>
<idno type="wicri:Area/Main/Merge">000690</idno>
<idno type="wicri:Area/Main/Curation">000685</idno>
<idno type="wicri:Area/Main/Exploration">000685</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Real-Word Typo Detection</title>
<author>
<name sortKey="Asonov, Dmitri" sort="Asonov, Dmitri" uniqKey="Asonov D" first="Dmitri" last="Asonov">Dmitri Asonov</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Russie</country>
<wicri:regionArea>Moscow</wicri:regionArea>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: asonov@fastpl.com</wicri:noCountry>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2010</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">B0626485B0ADCDBBA2FF81C856042E8AC4A377AF</idno>
<idno type="DOI">10.1007/978-3-642-12550-8_10</idno>
<idno type="ChapterID">10</idno>
<idno type="ChapterID">Chap10</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Context-sensitive spelling correction (CSSC) is a widely accepted and long studied formalization of the problem of finding and fixing contextually incorrect words. We argue that CSSC has its limitations as a model, and propose a weakened CSSC model (RWTD) to partially counter these limitations. We weaken the CSSC model by canceling its word-correction role. Thus, RWTD is focused solely on finding words that require correction. Once this is done, the actual correction process is performed by a human or a CSSC solution. We propose a preliminary solution for RWTD model that differs from related CSSC work in several ways. The solution does not rely on a set of confusion lists and detects not only a limited set of confusion typos, but almost any class of typos. The solution offers a flexible trade-off between the time a human is willing to spend on the task and the quality of the proofreading. It does not require POS tagging and may be applied seamlessly to different languages. Experiment running times prove to be acceptable for real-world applications. We report Brown corpus real-word typos that were exposed by implementing our solution. We also discuss experiments in applying the solution to other real-world test texts and demonstrate improved false positive and hit rates.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Russie</li>
</country>
</list>
<tree>
<country name="Russie">
<noRegion>
<name sortKey="Asonov, Dmitri" sort="Asonov, Dmitri" uniqKey="Asonov D" first="Dmitri" last="Asonov">Dmitri Asonov</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000685 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000685 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:B0626485B0ADCDBBA2FF81C856042E8AC4A377AF
   |texte=   Real-Word Typo Detection
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024